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ABSTRACT 

Several research works have focused on supporting index access in 
MapReduce systems. These works have allowed users to signifi- 
cantly speed up selective MapReduce jobs by orders of magnitude. 
However, all these proposals require users to create indexes up- 
front, which might be a difficult task in certain applications (such 
as in scientific and social applications) where workloads are evolv- 
ing or hard to predict. To overcome this problem, we propose LIAH 
{Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive ap- 
proach for indexing at minimal costs for MapReduce systems. The 
main idea of LIAH is to automatically and incrementally adapt 
to users' workloads by creating clustered indexes on HDFS data 
blocks as a byproduct of executing MapReduce jobs. Besides dis- 
tributing indexing efforts over multiple computing nodes, LIAH 
also parallelises indexing with both map tasks computation and 
disk I/O. All this without any additional data copy in main memory 
and with minimal synchronisation. The beauty of LIAH is that it 
piggybacks index creation on map tasks, which read relevant data 
from disk to main memory anyways. Hence, LIAH does not intro- 
duce any additional read I/O-costs and exploit free CPU cycles. As 
a result and in contrast to existing adaptive indexing works, LIAH 
has a very low (or invisible) indexing overhead, usually for the very 
first job. Still, LIAH can quickly converge to a complete index, 
i.e. all HDFS data blocks are indexed. Especially, LIAH can trade 
early job runtime improvements with fast complete index conver- 
gence. We compare LIAH with HAIL, a state-of-the-art indexing 
technique, as well as with standard Hadoop with respect to indexing 
overhead and workload performance. In terms of indexing over- 
head, LIAH can completely index a dataset as a byproduct of only 
four MapReduce jobs while incurring a low overhead of 11% over 
HAIL for the very first MapReduce job only. In terms of workload 
performance, our results show that LIAH outperforms Hadoop by 
up to a factor of 52 and HAIL by up to a factor of 24. 

1. INTRODUCTION 

In recent years, a huge number of research works have focused 
on improving the performance of Hadoop MapReduce [10, 12, 18, 



22, 24]. In particular, several researchers have focused on sup- 
porting efficient index access in Hadoop [28, 25, 23]. Some of 
these works have improved the performance of selective MapRe- 
duce jobs by orders of magnitude. However, all these indexing 
approaches have three main weaknesses. First, they require a high 
upfront cost or long idle times for index creation. Second, they 
can support only one physical sort order (and hence one clustered 
index) per dataset. Third, they require users to have a good knowl- 
edge of the workload in order to choose the indexes to create. 

Recently, we proposed HAIL [13] (Hadoop Aggressive Indexing 
Library) to solve the first two problems, i.e. high upfront index- 
ing costs and lack of supporting multiple sort orders. HAIL allows 
users to create multiple clustered indexes at upload time almost for 
free. As a result, users can speed up their MapReduce jobs by 
almost two orders of magnitude. But, this improvement only hap- 
pens if users create the right indexes when uploading their datasets 
to HDFS. This means that, like traditional indexing techniques [15, 
8, 1, 6, 9, 28, 25, 12, 23], HAIL requires users to decide upfront 
which indexes to create. Thus, HAIL as well as traditional index- 
ing techniques are not suitable for unpredictable or ever-evolving 
workloads [11]. In such scenarios, users often do not know which 
indexes to create beforehand. Scientific applications and social net- 
works are a clear example of such use-cases [3]. 

1.1 Motivation 

Let us see through the eyes of a group of scientists, say Alice and 
her colleagues, who want to analyse their daily experimental results 
using Hadoop MapReduce. Basically, the experimental results are 
collected in a large dataset (typically in the order of terabytes) con- 
taining many dozens of numeric attributes. To understand and in- 
terpret the experimental results, Alice and her colleagues navigate 
through the dataset according to the properties and correlations of 
the data [3]. The problem is that Alice and her colleagues typi- 
cally: (i) do not know the data access patterns in advance; (ii) have 
different interests and hence cannot agree upon common selection 
criteria at data upload time; (iii) even if they agree which attributes 
to index at data upload time, they might end up filtering records ac- 
cording to values on different attributes. Therefore, HAIL (as well 
as traditional indexing techniques) cannot help Alice and her col- 
leagues, because HAIL is still a static system that cannot adapt to 
changes in query workloads. 

One day Alice hears about adaptive indexing [19], where the 
general idea is to create indexes as a side-effect of query process- 
ing. Adaptive indexing aims at creating indexes incrementally in 
order to avoid high upfront index creation times. Alice is excited 
about the adaptive indexing idea since this could solve her (and 
her colleagues') problem. However, Alice notices that she cannot 
apply existing adaptive indexing works [14, 19, 20, 16, 21, 17] in 
MapReduce systems for several reasons: 



First, these techniques aim at converging to a global index for an 
entire attribute, which requires sorting the attribute globally. There- 
fore, these techniques perform many data movements across the 
entire dataset. Doing this in MapReduce would hurt fault-tolerance 
as well as the performance of MapReduce jobs. This is because we 
would have to move data across HDFS data blocks 1 in sync with 
all their three physical data block replicas. 

Second, even if Alice applied existing adaptive indexing tech- 
niques inside data blocks, these techniques would end up in many 
costly I/O operations to move data on disk. This is because these 
techniques consider main-memory systems and thus do not factor 
in the I/O-cost for reading/writing data from/to disk. Only one 
of these works [16] proposes an adaptive merging technique for 
disk-based systems. However, applying this technique inside a data 
block would not make sense in MapReduce since data blocks are 
typically loaded entirely into main memory anyways when process- 
ing map tasks. One may think about applying adaptive merging 
across data blocks, but this would again hurt fault-tolerance and 
the performance of MapReduce jobs as described above. 

Third, these works focus on creating unclustered indexes in 
the first place and hence it is only beneficial for highly selective 
queries. One of these works [20] introduced lazy tuple reorgan- 
isation in order to converge to clustered indexes. However, this 
technique needs several thousand queries to converge and its appli- 
cation in a disk-based system would again introduce a huge number 
of expensive I/O operations. 

Fourth, existing adaptive indexing approaches were mainly de- 
signed for single-node DBMSs. Therefore, applying these works 
in a distributed parallel systems, like Hadoop MapReduce, would 
not fully exploit the existing parallelism to distribute the indexing 
effort across several computing nodes. 

1.2 Idea 

We propose LIAH (Lazy Indexing and Adaptivity in Hadoop): 
a lazy and adaptive indexing approach for parallel disk-based sys- 
tems, such as MapReduce. The main idea behind LIAH is to exploit 
the existing MapReduce pipeline in order to build clustered indexes 
in a scalable, automatic, and almost invisible way as byproduct of 
job executions. For this, LIAH interprets incoming jobs as hints 
about what might be a worthwhile index. A salient feature of LIAH 
is that it piggybacks on job execution in such a way that no ad- 
ditional read I/O-cost is required for indexing purposes. In other 
words, LIAH only requires some additional I/O-cost for writing 
clustered indexes back to disk. This allows LIAH to quickly con- 
verge to a complete index, i.e. all HDFS data blocks are indexed, 
with a very low indexing overhead. Another interesting feature is 
that LIAH stores a clustered index created at query processing time 
in an additional HDFS file, called pseudo data block replica. In 
fact, a pseudo data block replica is another logically indexed replica 
for a given data block. Therefore, pseudo data block replicas allow 
us to support a different number of replicas for each data block, 
which is crucial for incremental indexing. 

Like existing adaptive indexing works, LIAH distributes the in- 
dexing effort over several queries to avoid negatively impacting the 
performance of an individual query. However, LIAH differs from 
existing adaptive indexing works in four major aspects: 

First, LIAH focuses on block level clustered indexes. This means 
that LIAH creates one clustered index for each data block instead of 
a single index per attribute. Consequently, LIAH reorders data only 
inside a data block, which preserves the fault-tolerance of Hadoop 



1 Henceforth, we refer to an HDFS data block simply as data block. 



since data is never shuffle across data blocks. Second, LIAH paral- 
lelises the adaptive indexing effort across several computing nodes 
in order to limit the indexing overhead. Third, LIAH considers 
disk-based systems and hence it factors in the cost of reading from 
and writing to disk. LIAH completely sorts a data block once it 
reads the block from disk. This avoids future expensive I/O oper- 
ations for refining the index inside a data block. Still, LIAH does 
not sort all data blocks in one pass in order to avoid that a single 
MapReduce job pays the high cost of index creation. Notice that 
adaptive merging [16], which also considers disk based systems, 
is orthogonal to the focus of LIAH. While LIAH produces a set of 
sorted partitions incrementally, adaptive merging aims at incremen- 
tally combining such sorted partitions. Fourth, LIAH creates clus- 
tered indexes rather than unclustered indexes. This allows LIAH to 
benefit from index scans even for lowly selective jobs. Since LIAH 
stores datasets in PAX representation [2], creating clustered indexes 
also allows LIAH to avoid expensive random read I/O operations 
for tuple reconstruction. 

1.3 Research Challenges 

The approach followed by LIAH triggers a number of interesting 
research challenges: 

(1.) How can we change the job execution pipeline to create clus- 
tered indexes at job execution time? How to index big data incre- 
mentally in a disk-based system? How to minimise the impact of 
indexing on job execution times? How to efficiently interleave data 
processing with indexing? How to create several clustered indexes 
for read-only data blocks at query time? How to support differ- 
ent number of replicas per data block? How will the job execution 
pipeline change for Alice and her colleagues? 

(2.) How can we change Hadoop to exploit newly created clus- 
tered indexes? How to distribute the indexing effort efficiently by 
considering data-locality and index placement across computing 
nodes? How to schedule map tasks to efficiently process indexed 
and non-indexed data blocks without affecting failover? How will 
jobs change from the perspective of Alice and her colleagues? 

1.4 Contributions 

We present LIAH, a lazy and adaptive indexing approach for 
MapReduce systems. The main goal of LIAH is to minimise the 
impact of indexing on job execution times. We make the following 
four contributions: 

(1.) We show how to effectively piggyback adaptive index creation 
on the existing MapReduce job execution pipeline. In particular, 
we show how to parallelise indexing with both the computation of 
map tasks and disk I/O. All this without any additional data copy 
in main memory and minimal synchronisation. A particularity of 
our approach is that we always index a data block entirely, i.e. in a 
single pass. As a result, LIAH not only allows map tasks of future 
jobs to perform an index access, but it also frees them from costly 
extra I/O operations for refining indexes. 

(2.) We show how to efficiently process pseudo data block replicas, 
i.e. data block replicas containing a clustered index adaptively cre- 
ated by LIAH. The beauty of our approach is that it is completely 
invisible from the users' perspective. LIAH takes care of perform- 
ing MapReduce jobs using normal data block replicas or pseudo 
data block replicas (or even both). Additionally, LIAH comes with 
its own scheduling policy, called LIAH Scheduling. The idea of 
LIAH Scheduling is to balance the indexing effort across comput- 
ing nodes so as to limit the impact of indexing on job runtime. As a 
side effect of balancing the indexing effort, LIAH improves parallel 
index access for future jobs as indexes are balanced across nodes. 



(3.) We propose a set of indexing strategies that makes LIAH aware 
of the performance and the selectivity of MapReduce jobs. We 
first present eager adaptive indexing, a technique that allows LIAH 
to quickly adapt to changes in users' workloads at a low index- 
ing overhead. In particular, eager adaptive indexing allows LIAH 
to trade early job runtime improvements with fast complete index 
convergence. Next, we show how LIAH can decide which data 
blocks to index based on the selectivities of jobs. Then, we present 
the invisible projection technique that allows LIAH to efficiently 
create clustered indexes for jobs having different attribute projec- 
tions. Additionally, in Appendix A, we present a lazy projection 
technique that allows LIAH to integrate an attribute into a clustered 
index only when the attribute is accessed by a job. 

(4.) We present an extensive experimental comparison of LIAH 
with Hadoop and HAIL [13]. We use two clusters, each having 
different types of CPU. A series of experiments shows the superi- 
ority of LIAH over both Hadoop and HAIL. In particular, our ex- 
perimental results demonstrate that LIAH quickly adapts to query 
workloads with a negligible indexing overhead. Our results also 
show that LIAH has a low overhead over Hadoop and HAIL for the 
very first job only: all the following jobs are faster in LIAH. 

2. RELATED WORK 

Offline Indexing. Indexing is a crucial step in all major 
DBMSs [15, 8, 1, 6, 9]. The overall idea behind all these ap- 
proaches is to analyze a query workload and decide which attributes 
to index based on these observations. Several research works have 
focused on supporting index access in MapReduce workflows [28, 
25, 12, 23]. However, all these offline approaches have three big 
disadvantages. First, they incur a high upfront indexing cost that 
several applications cannot afford (such as scientific applications). 
Second, they only create a single clustered index per dataset, which 
is not suitable for query workloads having selection predicates on 
different attributes. Third, they cannot adapt to changes in query 
workloads without the intervention of a DBA. Recently, we pro- 
posed HAIL [13] to solve the first two problems, but HAIL is an 
enhancement of the upload pipeline in HDFS. Therefore, HAIL still 
cannot adapt to changes in the query workload. LIAH completes 
the puzzle: it enhances the Hadoop MapReduce framework (and 
not HDFS) in order to allow Hadoop to adapt to query workloads. 

Online Indexing. Tuning a database at upload time has become 
harder as query workloads become more dynamic and complex. 
Thus, different DBMSs started to use online tuning tools to attack 
the problem of dynamic workloads [27, 4, 5, 26]. The idea is to 
continuously monitor the performance of the system and create (or 
drop) indexes as soon as it is considered beneficial. Manimal [7, 
22] can be used as an online indexing approach for automatically 
optimizing MapReduce jobs. The idea of Manimal is to generate a 
MapReduce job for index creation as soon as an incoming MapRe- 
duce job has a selection predicate on an unindexed attribute. Online 
indexing can then adapt to query workloads. However, online in- 
dexing techniques require to index a dataset completely in one pass. 
Therefore, online indexing techniques simply transfer the high cost 
of index creation from upload time to query processing time. 

Adaptive Indexing. LIAH is inspired by database cracking [19], 
which aims at removing the high upfront cost barrier of index cre- 
ation. The main idea of database cracking is to start organising 
a given attribute (i.e. to create an adaptive index on an attribute) 
when it receives for the first time a query with a selection predicate 
on that attribute. Thus, future incoming queries having predicates 
on the same attribute continue refining the adaptive index as long 
as finer granularity of key ranges is advantageous. Key ranges in 



an adaptive index are disjoint, where keys in each key range are 
unsorted. Basically, adaptive indexing performs for each query one 
step of quicksort using the selection predicates as pivot for par- 
titioning attributes. LIAH differs from adaptive indexing in four 
aspects. First, LIAH creates a clustered index for each data block 
and hence avoids any data shuffling across data blocks. This allows 
LIAH to preserve Hadoop fault- tolerance. Second, LIAH consid- 
ers disk-based systems and thus it factors in the cost of reorganising 
data inside data blocks. Third, LIAH parallelises the indexing effort 
across several computing nodes to minimise the indexing overhead. 
Fourth, LIAH focuses on creating clustered indexes instead of un- 
clustered indexes. A follow-up work [20] focuses on lazily aligning 
attributes to converge into a clustered index after a certain number 
of queries. However, it considers a main memory system and hence 
does not factor in the I/O-cost for moving data many times on disk. 

Adaptive Merging. Another related work to LIAH is the adaptive 
merging [16]. This approach uses standard B -trees to persist inter- 
mediate results during an external sort. Then, it only merges those 
key ranges that are relevant to queries. In other words, adaptive 
merging incrementally performs external sort steps as a side effect 
of query processing. However, this approach cannot be applied di- 
rectly for MapReduce workflows for three reasons. First, like adap- 
tive indexing, this approach creates unclustered indexes. Second, 
merging data in MapReduce destroys Hadoop fault-tolerance and 
hurts the performance of MapReduce jobs. This is because adaptive 
merging would require us to merge data from several data blocks 
into one. Notice that, merging data inside a data block would not 
make sense as a data block is typically loaded entirely into main 
memory by map tasks anyways. Third, it has an expensive initial 
step to create the first sorted runs. Recently, a follow-up work uses 
adaptive indexing to reduce the cost of the initial step of adaptive 
merging in main memory [21]. However, it considers main mem- 
ory systems and still has the first two problems. 

To our knowledge, this work is the first research effort to propose 
an adaptive indexing solution suitable for MapReduce systems. 

3. HAIL RECAP 

Recall that the main goal of LIAH is to keep the impact of index 
creation on job runtime minimal. This is similar to the idea of 
HAIL, which shows that indexing during HDFS upload is possible 
with basically no overhead. LIAH inherits this feature from HAIL 
and exploits this feature for creating indexes at query processing 
time. Hence, in the following, we briefly explain the data upload 
and MapReduce job execution pipeline in HAIL. For details about 
HAIL or the Hadoop execution plan see [13] and [12], respectively. 

3.1 Data Upload in HAIL 

Like in Hadoop, in HAIL, the first step for a user is to upload her 
dataset to HDFS. During the upload process, HAIL splits datasets 
(i.e. files) into data blocks (usually in the size of 64MB - 256MB). 
Then, for each data block, HAIL stores several replicas (three by 
default) on different nodes for fault tolerance and load balancing 
reasons. HAIL differs from Hadoop in two major aspects: 

(1.) HAIL supports logical data block replication. In other words, 
HAIL can store the physical data block replicas for a given logi- 
cal data block in different physical layouts as long as they contain 
logically the same data. This follows the same principle as Trojan 
Layouts [24]. Notice that this is in contrast to Hadoop, which con- 
siders physical data block replication (i.e. all physical data block 
replicas of the same logical data block are byte-identical). 

(2.) HAIL can create as many clustered indexes as data block repli- 
cas. This is possible as HAIL uses a logical data block replication 



and thus HAIL can exploit different sort orders for each physical 
data block replica. As a result, users can configure the clustered in- 
dexes to create for their datasets. When uploading a dataset, HAIL 
transforms the dataset from textual row into binary PAX [2] rep- 
resentation and creates the clustered indexes as specified by users. 
Notice that, HAIL piggybacks index creation on the natural HDFS 
process of copying the data from disk to main memory. Since this 
process is I/O-bound, HAIL can exploit unused CPU cycles to gen- 
erate the requested indexes with basically unnoticeable overhead. 
HAIL keeps detailed information about created indexes as index 
header of data block replicas. In particular, a HAIL data block con- 
tains: (i) a block header (containing the data length and attribute 
offsets), (ii) an index header, (iii) the index data, and (iv) the data 
content. Additionally, HAIL establishes a mapping {block Jd — >> 
Ust<block-replicaJnfo>} on the HDFS NameNode for query pro- 
cessing purposes. Notice that, the block -replica Jnfo contains the 
node storing the replica and the available indexes for that replica. 

3.2 Job Execution in HAIL 

Executing MapReduce jobs in HAIL differs from Hadoop in 
three aspects: 

(1.) Users can annotate their map functions with selections and 
projections in order to benefit from clustered indexes and the PAX 
representation of data blocks. 

(2.) In the splitting phase (HailSp lifting), HAIL queries the HDFS 
NameNode to find out if there exists a matching index with respect 
to annotated selections of incoming MapReduce jobs. If a suitable 
index exists, the HailSplitting policy forms input splits 2 (HAIL In- 
putSplits) with several data block replicas that are stored on the 
same node. This allows HAIL to schedule a single map task to 
one node and avoid the high overhead for initialising and finalis- 
ing map tasks. To not impact failover, one can limit the number of 
data blocks in a single HAIL InputSplit based on selectivities. The 
idea is that processing an input split should not take longer than 
performing a full scan over a single data block. If no suitable index 
exists, HAIL then falls back to Hadoop splitting by scheduling one 
map task per data block. 

(3.) A map task processes its HAIL InputSplit by reading only the 
projected attributes. If a suitable index exists, the map task per- 
forms an index scan over its input split. Thus, HAIL usually dra- 
matically reduces the I/O-cost for selective jobs. If no suitable in- 
dex exists, HAIL falls back to full scan, but it still benefits from the 
PAX layout by reading only the required attributes. 

It is worth noting that HAIL cannot benefit from its indexes if the 
selection predicate is on any unindexed attribute. This is because 
there is no way for HAIL to adapt to query workloads. LIAH over- 
comes this problem: it creates additional indexes at job runtime 
based on the selection predicates of incoming jobs. 

4. LIAH 

In this section, we discuss the fundamentals of LIAH: an ap- 
proach to efficiently support Lazy Indexing and Adaptivity in 
Hadoop. The core idea of LIAH is to create missing but promising 
indexes as byproducts of full scans in the map phase of MapReduce 
jobs. Our concept is to piggyback on a procedure that is naturally 
reading the relevant data from disk to main memory anyways. This 
allows LIAH to completely save the data read cost for adaptive in- 
dex creation. Another beauty of LIAH is that it builds clustered 
indexes in parallel to the execution of map tasks. As map tasks are 
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Figure 1: LIAH pipeline. 
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usually I/O dominated, LIAH can then exploit under-utilised CPU 
cycles for index creation. 

In the following, we first give a general overview of the LIAH 
pipeline in Section 4.1. Then, in Section 4.2, we focus on the in- 
ternal components for building and storing clustered indexes. In 
Section 4.3, we present how LIAH accesses the indexes created at 
job runtime in a way that is transparent to the MapReduce job ex- 
ecution pipeline. Next, in Section 4.4, we discuss how to make the 
indexing overhead over MapReduce jobs almost invisible to users. 
Finally, in Section 4.5, we present LIAH Scheduling, which allows 
us to balance index creation effort across all available nodes. 

4.1 Job Execution Pipeline 

Let us explain the job execution pipeline in LIAH with an exam- 
ple. Assume Alice wants to analyse the experimental results of a 
set of experiments she ran. Recall that she collects the experimental 
results in a large dataset containing many numeric attributes. As- 
sume the fortunate case that Alice studies her dataset a little bit. She 
identifies attributes a, b, and c as the most interesting search crite- 
ria for her MapReduce jobs. As HDFS uses a default replication 
factor of three, Alice decides to configure LIAH to create a clus- 
tered index on each of these three attributes at upload time. This 
is possible because LIAH inherits the index creation at upload time 
feature from HAIL. Thus, as long as Alice sends jobs with selec- 
tion predicates on a, b, or c, LIAH can benefit from clustered index 
scans. In these cases, LIAH behaves exactly as HAIL, i.e. map 
tasks perform an index scan in order to fetch only the qualifying 
records from disk. However, as soon as Alice (or one of her col- 
leagues) sends a new job (say jobd) with a selection predicate on a 
different attribute (e.g. on attribute d.), LIAH cannot benefit from 
index scans anymore. In contrast to HAIL, LIAH takes this missed 
chances of index scans as hints on how to improve the repertoire 
of indexes for future jobs. LIAH piggybacks the creation of a clus- 
tered index over attribute d on the execution of jobd. Without any 
loss of generality, we assume that jobd projects all attributes from 
its input dataset. We will drop this assumption in Section 5.3. 

Figure 1 illustrates the general workflow of how LIAH processes 
map tasks of jobd when no suitable index is available. As soon as 
LIAH schedules a map task to a specific TaskTracker 3 , e.g. Task- 
Tracker 5, the LIAH RecordReader of the map task first reads the 
metadata (including HDFS paths, offsets, and index availability) 
from the LIAH InputSplit©. With this metadata, the LIAH Recor- 
dReader checks whether a suitable index is available for its input 
data block (say blocks)- As no index on attribute d is available, 
the LIAH RecordReader opens an input stream to the local replica 

3 A Hadoop instance responsible to execute map and reduce tasks. 



of blocks stored on DataNode 5. Then, the LIAH RecordReader 
reads the metadata from the data block header to obtain the offsets 
of the attributes required by jobd. Next, the LIAH RecordReader: 
(i) loads all the values of the required attributes from disk to main 
memory ©; (ii) reconstructs the records (as data blocks are in PAX 
representation); (iii) feeds the map function with each record ©. 
Here lies the beauty of LIAH: a data block that is a potential can- 
didate for indexing was completely transferred to main memory as 
a natural part of the job execution process. In addition to feeding 
the entire blocks to the map function, LIAH can create a clustered 
index on attribute d to speed up future jobs. For this, the LIAH 
RecordReader passes blocks to the Adaptive Indexer as soon as 
the map function finishes processing the data block©- 4 The Adap- 
tive Indexer, in turn, sorts the data in blocks according to attribute 
d, aligns other attributes through reordering, and creates a sparse 
clustered index as described in [13] ©. Finally, the Adaptive In- 
dexer stores this index with a copy of blocks (sorted on attribute 
d) as a pseudo data block replica ©. Additionally, the Adaptive 
Indexer registers the new created index for block 42 with the HDFS 
NameNode©. In fact, the implementation of the LIAH pipeline in- 
volves some interesting technical challenges. We discuss the LIAH 
pipeline in more detail in the remainder of this section. 

4.2 Adaptive Indexer 

Since LIAH is an automatic process that is not explicitly re- 
quested by users, LIAH should not impose unexpectedly signifi- 
cant performance penalties on users' jobs. Piggybacking adaptive 
indexing on map tasks allows us to completely save the read I/O- 
cost. However, the indexing effort is shifted to query time. As a 
result, any additional time involved in indexing will potentially add 
to the total runtime of MapReduce jobs. Therefore, the first concern 
of LIAH is: how to make adaptive index creation efficient? 

To overcome this issue, the idea of LIAH is to run the mapping 
and indexing processes in parallel. However, interleaving map task 
execution with indexing bears the risk of race conditions between 
map tasks and the Adaptive Indexer on the data block. In other 
words, the Adaptive Indexer might potentially reorder data inside 
a data block, while the map task is still concurrently reading the 
data block. One might think about copying data blocks before in- 
dexing to deal with this issue. Nevertheless, this would entail the 
additional runtime and memory overhead of copying such memory 
chunks. For this reason, LIAH does not interleave the mapping and 
indexing processes on the same data block. Instead, LIAH inter- 
leaves the indexing of a given data block (e.g. blocks) with the 
mapping phase of the succeeding data block (e.g. blocks). For 
this, LIAH uses a producer-consumer pattern: a map task acts as 
producer by offering a data block to the Adaptive Indexer, via a 
bounded blocking queue, as soon as it finishes processing the data 
block; in turn, the Adaptive Indexer is constantly consuming data 
blocks from this queue. As a result, LIAH can perfectly interleave 
map tasks with indexing, except for the first and last data block to 
process in each node. It is worth noting that the queue exposed by 
the Adaptive Indexer is allowed to reject data blocks in case a cer- 
tain limit of enqueued data blocks is exceeded. This prevents the 
Adaptive Indexer to run out of memory because of overload. Still, 
future MapReduce jobs with a selection predicate on the same at- 
tribute (i.e. on attribute d) can at their turn take care of indexing 
the rejected data blocks. Once the Adaptive Indexer pulls a data 
block from its queue, it processes the data block using two internal 
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Figure 2: Adaptive Indexer internals. 

components: the Index Builder and the Index Writer. Figure 2 il- 
lustrates the pipeline of these two internal components, which we 
discuss in the following. 

4. 2. 1 Index Builder 

The Index Builder is a daemon thread that is responsible for cre- 
ating sparse clustered indexes on data blocks in the data queue. 
With this aim, the Index Builder is constantly pulling one data 
block after another from the data block queue ©. Then, for each 
data block, the Index Builder starts with sorting the attribute col- 
umn to index (attribute d in our example) ©. Additionally, the 
Index Builder builds a mapping {oldjposition — >> new -position} 
for all values as a permutation vector. After that, the Index Builder 
uses the permutation vector to reorder all other attributes in the of- 
fered data block ©. Once the Index Builder finishes sorting the 
entire data block on attribute d, it builds a sparse clustered index on 
attribute d @. Then, the Index Builder passes the newly indexed 
data block to the Index Writer ©. The Index Builder also com- 
municates with the Index Writer via a blocking queue. This allows 
LIAH to also parallelise indexing with the I/O process for storing 
newly indexed data blocks. 

4.2.2 Index Writer 

The Index Writer is a daemon thread that is responsible for per- 
sisting indexes created by the Index Builder to disk. The Index 
Writer continuously pulls newly indexed data blocks from its queue 
in order to persist them on HDFS ©. Once the Index Writer pulls a 
newly indexed data block (say blocks), it creates the block meta- 
data and index metadata for &/oc/c42©. Notice that a newly indexed 
data block is just another replica of the logical data block, but with 
a different sort order. For instance, in our example of Section 4.1, 
creating an index on attribute d for blocks leads to having four 
data block replicas for blocks ' one replica for each of the first four 
attributes. Therefore, the Index Writer could simply write a new in- 
dexed data block as another replica. However, HDFS supports data 
block replication only at the file level, i.e. HDFS replicates all the 
data blocks of a given dataset the same number of times. This goes 
against the incremental nature of LIAH. 

To solve this problem, the Index Writer creates a pseudo data 
block replica, which is a new HDFS file ®. Therefore, although 
the pseudo data block replica is a logical copy of blocks, the Na- 
meNode does not recognise it as a normal data block replica. In- 
stead, the NameNode simply sees the pseudo data block replica as 
another index available for blocks- To avoid shipping data from 
one node to another, the Index Writer aims at storing the pseudo 
data block replica locally on DataNode 5. With this aim, the In- 
dex Writer stores the pseudo data block replica with replication 
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Figure 3: LIAH RecordReader internals. 



factor one. The Index Writer follows a naming convention, which 
contains the identifier of the block and the main index attribute, to 
uniquely identify a pseudo data block replica. It is worth noting 
that a map task can compete with a speculative map task (specu- 
lative execution) in a different node to create a pseudo data block 
replica. To deal with this race condition, we follow the same pat- 
tern used by map tasks to store their intermediate output in original 
Hadoop. This means that each Index Writer first stores the pseudo 
data block replica in a temporary file and then tries to rename it 
after completion. Only the first Index Writer will succeed and the 
second one will remove its temporary pseudo data block replica. 
Additionally, the successful Index Writer informs the NameNode 
about the new index on attribute d ©. This allows LIAH to take 
the newly created indexes into account for processing future jobs. 

4.3 Pseudo Data Block Replicas 

Recall from Section 4.2.2 that a pseudo data block replica is a 
logical copy of a data block (in a different sort order) that is stored 
by LIAH as a new HDFS file rather than as a normal data block 
replica. This allows LIAH to keep a different replication factor 
on a block basis rather than on a file basis. As pseudo data block 
replicas are stored in different HDFS files than normal data block 
replicas, an important question arises: how to access pseudo data 
block replicas in an invisible way for users? 

LIAH achieves this transparency via its RecordReader (the 
LIAH RecordReader). Users continue annotating their map func- 
tions (with selection predicates and projections) as with HAIL. The 
LIAH RecordReader takes care of automatically switching from 
normal to pseudo data block replicas. For this, the LIAH Recor- 
dReader uses the LIAH InputStream, a wrapper of the Hadoop 
FSInputStream. We discuss the details of both the LIAH Recor- 
dReader and the LIAH InputStream below. 

Figure 3 illustrates the internal pipeline of the LIAH Recor- 
dReader when processing a given LIAH InputSplit. When a map 
task starts, the LIAH RecordReader first reads the metadata of its 
LIAH InputSplit in order to check if there exists a suitable index 
to process the input data block (blocks) ©• If a suitable index 
is available, the LIAH RecordReader initialises the LIAH Input- 
Stream with the selection predicate of jobd as a parameter©. Inter- 
nally, the LIAH InputStream checks if the index resides in a normal 
or pseudo data block replica©. This allows the LIAH InputStream 



to open an input stream to the right HDFS file. This is because nor- 
mal and pseudo data block replicas are stored on different HDFS 
files. While all normal data block replicas belong to the same 
HDFS file, each pseudo data block replica belongs to a different 
HDFS file®. As in our example the index on attribute d for block 42 
resides in a pseudo data block replica, the LIAH InputStream opens 
an input stream to the HDFS file /pseudo/blk_42/d ©. As a 
result, the LIAH RecordReader does not care from which file it is 
reading since normal and pseudo data block replicas have the same 
format. Therefore, switching between a normal and a pseudo data 
block replica is not only invisible to users, but also to the LIAH 
RecordReader. The LIAH RecordReader just reads the block and 
index metadata using the LIAH InputStream ©. After perform- 
ing an index lookup for the selection predicate of jobd, the LIAH 
RecordReader loads only the qualifying tuples (e.g. tuples 1024 - 
2048) from the projected attributes (a, b, c, and d) ©. Finally, the 
LIAH RecordReader forms key-value pairs and passes only quali- 
fying pairs to the map function ®. 

In case that no suitable index exists, the LIAH RecordReader 
takes the Hadoop InputStream, which opens an input stream to any 
normal data block replica, and falls back to full scan. 

4.4 Lazy Indexing 

The blocking queues used by the Adaptive Indexer allow us to 
easily protect LIAH against CPU overloading. However, writing 
pseudo data block replicas can also slow down the parallel read and 
write processes of MapReduce jobs. In fact, the negative impact of 
extra I/O operations can be high as MapReduce jobs are typically 
I/O-bound. As a result, LIAH as a whole might become slower 
even if the Adaptive Indexer can computationally keep up with the 
job execution. So, the question that arises is: how to write pseudo 
data block replicas efficiently? 

LIAH solves this problem by making indexing incremental, 
i.e. LIAH spreads index creation over multiple MapReduce jobs. 
The goal is to balance index creation cost over multiple MapRe- 
duce jobs so that users perceive small (or no) overhead in their 
jobs. To do so, LIAH uses an offer rate, which is a ratio that lim- 
its the maximum number of pseudo data block replicas (i.e. num- 
ber of data blocks to index) to create during a single MapReduce 
job. For example, using an offer rate of 10%, LIAH indexes in a 
single MapReduce job at maximum one data block out of ten pro- 



cessed data blocks (i.e. LIAH only indexes 10% of the total data 
blocks). Notice that, consecutive jobs with selections on the same 
attribute benefit from pseudo data block replicas created during pre- 
vious jobs. This strategy brings two major advantages. First, LIAH 
can reduce the additional I/O introduced by indexing to any level 
that is acceptable for the user. Second, the indexing effort done by 
LIAH is according to the current query workload. Another advan- 
tage of using an offer rate is that users can decide how fast they 
want to converge to a complete index, i.e. all data blocks are in- 
dexed. For instance, using an offer rate of 10%, LIAH would re- 
quire 10 MapReduce jobs with a selection predicate on the same 
attribute to converge to a complete index (i.e. all data blocks are 
indexed). Therefore, on the one hand, the investment in terms of 
time and space for MapReduce jobs with selection predicates on 
unfrequent attributes is minimised. However, on the other hand, 
MapReduce jobs with selection predicates on frequent attributes 
quickly converge to a completely indexed copy. We discuss more 
details about different offer rate strategies in Section 5. 

4.5 LIAH Scheduling 

An interesting result we found in [13] is that the initialisation 
and finalisation costs of map tasks are so high that they basically 
dominate short running jobs. Thus, reducing the number of map 
tasks is crucial to improve the performance of MapReduce jobs. 

To deal with this problem, we introduce LIAH Scheduling, 
an extension of the HAIL Scheduling proposed in [13]. LIAH 
Scheduling works as follows. First, LIAH Scheduling partitions 
all input data blocks into indexed data blocks and unindexed data 
blocks. Second, LIAH Scheduling combines several indexed data 
blocks into one split as described in [13]. This allows LIAH to 
reduce the number of map tasks to schedule, thereby reducing the 
total overhead for initialising and finalising map tasks. Notice that, 
after this step, LIAH can obtain the exact number of already exist- 
ing indexes for each computing node. Third, like original Hadoop, 
LIAH creates one map task per unindexed data block. For each 
map task, LIAH considers r different computing nodes as possible 
locations to schedule a map task, where r is the replication factor of 
the input dataset. However, in contrast to original Hadoop, LIAH 
tries to schedule a map task to the computing node with the smallest 
number of existing indexes. As a result, LIAH can: (i) better paral- 
lelise index access for future MapReduce jobs and (ii) increase the 
chances to keep both normal and pseudo data block replicas in the 
same node. The last point prevents LIAH to shuffle data through 
the network when writing pseudo data block replicas. 

5. ADAPTIVE INDEXING STRATEGIES 

We now present three strategies that allow LIAH to improve the 
performance of MapReduce jobs. We first present eager index- 
ing, a technique that allows LIAH to adapt its incremental index- 
ing mechanism to the number of already created pseudo data block 
replicas. We then discuss how LIAH can prioritise data blocks for 
indexing based on their selectivity. Finally, we introduce invisi- 
ble projection, a new technique to deal with partial projections of 
users' MapReduce jobs. 

5.1 Eager Adaptive Indexing 

Recall that LIAH uses an offer rate to throttle down adaptive 
indexing efforts to an acceptable (or even invisible) degree for users 
(see Section 4.4). However, let us make two important observations 
that could make a constant offer rate not desirable for certain users: 

(1.) Using a constant offer rate, the job runtime of consecutive 
MapReduce jobs having a filter condition on the same attribute is 



Table 1: Cost model parameters. 



Notation 


Description 


slots 


The number of map tasks that can run in parallel 




in a given Hadoop cluster 


^blocks 


The number of data blocks of a given dataset 


T^idx Blocks 


The number of blocks with a relevant index 




The number of map waves performing a full 




scan 


tfsw 


The average runtime of a map wave performing 




a full scan (without adaptive indexing overhead) 


UdxOverhead 


The average time overhead of adaptive indexing 




in a map wave 


TidxOverhead 


The total time overhead of adaptive indexing 


T 

J- IS 


The total runtime of the map waves performing 




an index scan 


Tjob 


The total runtime of a given job 


Ttarget 


The targeted total job runtime 


P 


The ratio of data blocks (w.r.t. miocks) offered 




to the Adaptive Indexer 



not constant. Instead, they have an almost linearly decreasing run- 
time up to the point where all blocks are indexed. This is because 
the first MapReduce job is the only to perform a full scan over all 
the data blocks of a given dataset. Consecutive jobs, even when 
indexing and storing the same amount of blocks, are likely to run 
faster as they benefit from all indexing work of their predecessors. 

(2.) LIAH actually delays indexing by using an offer rate. The 
tradeoff here is that using a lower offer rate leads to a lower index- 
ing overhead, but it requires more MapReduce jobs to index all the 
data blocks in a given dataset. However, some users want to limit 
the experienced indexing overhead and still desire to benefit from 
complete indexing as soon as possible. 

Therefore, we propose an eager adaptive indexing strategy to 
deal with this problem. The basic idea of eager adaptive indexing 
is to dynamically adapt the offer rate for MapReduce jobs accord- 
ing to the indexing work achieved by previous jobs. In other words, 
eager adaptive indexing tries to exploit the saved runtime and rein- 
vest it as much as possible into further indexing. To do so, LIAH 
first needs to estimate the runtime gain (in a given MapReduce job) 
from performing an index scan on the already created pseudo data 
block replicas. For this, LIAH uses a cost model to estimate the to- 
tal runtime, T jo b, of a given MapReduce job (Equation 1). Table 1 
lists the parameters we use in the cost model. 



Tjob — Ti s -\~ tf sw • Tlfsw TidxOverhead 



(1) 



We define the number of map waves performing a full scan, rif sw , 
as ^ n blocks -n i( i X Biock S ^ intuitively, the total runtime T job of a job 
consists of three parts. First, the time required by LIAH to process 
the existing pseudo data block replicas, i.e. all data blocks having 
a relevant index, Ti S . Second, the time required by LIAH to pro- 
cess the data blocks without a relevant index, tf sw • rif sw . Third, 
the time overhead caused by adaptive indexing, T idx Overhead- 5 
The adaptive indexing overhead depends on the number of data 
blocks that are offered to the Adaptive Indexer and the average 
time overhead observed for indexing a block. Formally, we define 

TidxOverhead as follows: 

TidxOverhead = TidxOverhead ' Hlin (^f> • |" ^° C ^ S J •> Tlfsw^j (2) 



5 It is worth noting that TidxOverhead denotes only the additional 
runtime that a MapReduce job has due to adaptive indexing. 



We can use this model to automatically calculate the offer rate 
p in order to keep the adaptive indexing overhead acceptable for 
users. Formally, from Equations 1 and 2, we deduct p as follows: 

Ttarget Ti s tf sw • Tlfsw 

^ ~ /• , ^ I , . [ n block 7l 
,L slots 

Therefore, given a target job runtime Ttarget, LIAH can auto- 
matically set p in order to fully use this time budget for creating in- 
dexes and use the gained runtime in the next jobs either to speed up 
the jobs or to create even more indexes. Usually, we choose T ta rget 
to be equal to the runtime of the very first job so that users can ob- 
serve a stable runtime till almost everything is indexed. However, 
users can set T ta rget to any time budget in order to adapt the index- 
ing effort to their needs. Notice that, since accessing pseudo data 
block replicas is independent of p, LIAH first processes pseudo 
data block replicas and measures T is , before deciding what offer 
rate to use for the unindexed blocks. The average runtimes tf sw 
(from Equation 1) and tidxOverhead (from Equation 2) can be mea- 
sured in a calibration job or given by users. 

On the one hand, LIAH can now adapt the offer rates to the per- 
formance gains obtained from performing index scans over the al- 
ready indexed data blocks. On the other hand, by gradually increas- 
ing the offer rate, eager adaptive indexing prioritises complete in- 
dex convergence over early runtime improvements for users. Thus, 
users no longer experience an incremental and linear speed up in 
job performance until the index is eventually complete, but instead 
they experience a sharp improvement when LIAH approaches to 
a complete index. In summary, besides limiting the overhead of 
adaptive indexing, the offer rate can also be considered as a tuning 
knob to trade early runtime improvements with faster indexing. 

5.2 Selectivity-based Indexing 

Earlier, we saw that LIAH uses an offer rate to limit the num- 
ber of data blocks to index in a single MapReduce job. For this, 
LIAH uses a round robin policy to select the data blocks to pass to 
the Adaptive Indexer. This sounds reasonable under the assump- 
tion that data is uniformly distributed. However, datasets are typi- 
cally skewed in practice and hence some data blocks might contains 
more qualifying tuples than others under a given query workload. 
Consequently, indexing highly selective data blocks before other 
data blocks promises higher performance benefits. 

Therefore, LIAH can also use a selectivity-based data block se- 
lection approach for deciding which data blocks to use. The overall 
idea is to use available computing resources in order to maximise 
the expected performance improvement for future MapReduce jobs 
running on partially indexed datasets. The big advantage of this 
approach is that users can perceive higher improvements in perfor- 
mance for their MapReduce jobs from the very first runs. Addition- 
ally, as a side-effect of using this approach, LIAH can adapt faster 
to the selection predicates of MapReduce jobs. However, how can 
LIAH efficiently obtain the selectivities of data blocks? 

For this, LIAH exploits the natural process of map tasks to pro- 
pose data blocks to the Adaptive Indexer. Recall that a map task 
passes a data block to the Adaptive Indexer once the map task fin- 
ished processing the block. Thus, LIAH can obtain the accurate se- 
lectivity of a data block by piggybacking on the map phase: when 
the data block is filtered according to the provided selection pred- 
icate. This allows LIAH to have perfect knowledge about selec- 
tivities for free. Given the selectivity of a data block, LIAH can 
decide if it is worth to index the data block or not. In our current 
LIAH prototype, a map task proposes a data block to the Adaptive 
Indexer if the percentage of qualifying tuples in the data block is 
equal or higher than 80%. However, users can adapt this threshold 



to their applications. Notice that with the statistics on data block 
selectivities, LIAH can also decide which indexes to drop in case 
of storage limitations. However, a discussion on an index eviction 
strategy is out of the scope of this paper. 

5.3 Invisible Projection 

In Section 4.1, we discussed the general flow of LIAH for creat- 
ing clustered indexes in an adaptive and incremental manner. For 
simplicity, in such discussion, we implicitly assume that consec- 
utive MapReduce jobs (with a selection predicate on the same at- 
tribute) read all (or the same) attributes from disk. Indeed, this is 
the simplest case for indexing data blocks, because the entire data 
blocks (or all required attributes) are available in main memory. As 
a result, the index attribute can be sorted and all other attributes can 
be reordered at the same time for alignment reasons. 

However, this becomes challenging when MapReduce jobs have 
different projections. For example, consider a dataset having four 
attributes a, b, c, d. Assume that LIAH is using an offer rate of 
50% and that there is no index on any of these attributes yet. Now, 
consider a first MapReduce job (say jobi) having a selection pred- 
icate on d and projecting attribute b. Since LIAH does not have an 
index on attribute d, LIAH creates a clustered index on attribute d 
and aligns attribute b with respect to attribute d. As a result of run- 
ning jobi, LIAH ends up by having a clustered index on d (and b 
aligned) for 50% of the data blocks (one index for each block) in the 
dataset. Now, consider a second MapReduce job (say jobi) having 
a selection predicate on d and projecting attribute c. It is here that 
LIAH faces a problem: LIAH cannot fully benefit from the already 
50% indexed data blocks in the dataset. This is because attribute c 
is still not aligned with respect to attribute d. Hence, performing an 
index scan for these data blocks would lead LIAH to perform many 
random I/O operations for tuple reconstruction. The same applies 
for other MapReduce jobs with different projections. For instance, 
a MapReduce job (say jobs), having a selection predicate on at- 
tribute d and projecting attribute a, cannot benefit from the indexes 
created neither by jobi nor by job^. Therefore, the question is: 
how to create clustered indexes when MapReduce jobs have filter 
conditions on the same attribute but have different projections? 

To deal with this problem, we introduce the invisible projection 
technique. The idea is to additionally read all missing attributes 
before passing a data block to the Adaptive Indexer. Notice that, 
LIAH applies the invisible projection technique only for the data 
blocks that map tasks propose to the Adaptive Indexer. The beauty 
of this technique is that it is transparent for users as map tasks still 
process the attributes required by MapReduce jobs. For example, 
consider again jobi from the above example. In this case, LIAH 
(the HAIL RecordReader) provides only attributes d and b to the 
map function of map tasks, but LIAH provides all attributes (a, b, 
c, and d) to the Adaptive Indexer. This way LIAH ensures that 
data blocks are always completely available in main memory for 
the Adaptive Indexer. 

The reader might think that the invisible projection approach is 
not suitable for data-intensive applications as it requires still extra 
I/O operations for loading unprojected attributes. However, this is 
far from the truth, because the extra I/O operations are only for the 
data blocks offered to the Adaptive Indexer. Furthermore, the im- 
pact of invisible projection also depends on the proportion of pro- 
jected attributes. Therefore, if the proportion of projected attributes 
is low, LIAH can always decrease the offer rate in order to keep an 
acceptable adaptive indexing overhead. Alternatively, LIAH could 
also switch to a lazy projection approach to only read the attributes 
required by MapReduce jobs. Even though the current LIAH pro- 
totype supports only the invisible projection technique, we discuss 



the lazy projection technique in Appendix A. 

6. EXPERIMENTS 

We evaluate the efficiency of LIAH to adapt to query workloads 
and compare it with Hadoop and HAIL. We measure the perfor- 
mance of LIAH with three main objectives in mind: (i) to measure 
the adaptive indexing overhead that LIAH generates over the run- 
time of MapReduce jobs; (ii) to evaluate both how fast LIAH can 
adapt to workloads and how well MapReduce jobs can benefit from 
LIAH; (iii) to study how well each of the adaptive indexing strate- 
gies of LIAH allow MapReduce jobs to improve their runtime. 

6.1 Setup 

Cluster. We use two different clusters in our experiments. Our 
first cluster (Cluster- A), is a 10-node cluster where each node has: 
one 2.66GHz Quad Core Xeon processor; 4x4GB of main memory; 
1x750GB SATA hard disk; three one Gigabit network cards. Our 
second cluster (Cluster-B), is a 4-node cluster where each node has: 
one 3.46 GHz Hexa Core Xeon X5690 processors; 20GB of main 
memory; one 278GB SATA hard disk (for the OS) and one 837GB 
SATA hard disk (for HDFS); two one Gigabit network cards. We 
use Cluster-B to measure the influence of more efficient processors 
on the behavior of LIAH. For both Cluster- A and Cluster-B, we use 
a 64-bit openSUSE 12.1 OS and the ext3 filesystem. 

Datasets. We use the web log dataset (UserVisits) from the 
HAIL paper [13]. This dataset has nine attributes, which are mostly 
strings, and has a total size of 40GBxnumberOJNodes, i.e. 400GB 
for Cluster- A and 160GB for Cluster-B. Additionally, we use a 
Synthetic dataset containing only numeric attributes as scien- 
tific datasets. The Synthetic dataset has six attributes and a 
total size of 50GBxnumberOjNodes, i.e. 500GB for Cluster-A 
and 200GB for Cluster-B. We generate the values for the first at- 
tribute in the range [1..10] and with an exponential repetition for 
each value, i.e. 10* _1 where i £ [1..10]. We generate the other five 
attributes at random. Then, we shuffle all tuples across the entire 
dataset in order to have the same distribution across data blocks. 

MapReduce Jobs. For the UserVisits dataset, we consider 
eleven jobs (JobUVl - JobUVll) with a selection predicate on 
attribute searchWord and with a full projection (i.e. projecting 
all 9 attributes). The first four jobs JobUVl - JobUV4 have a 
selectivity of 0.4% (1.24 million output records) and the remaining 
seven jobs ( JobUV5 - JobUVl 1) have a selectivity of 0.2% (0.62 
million output records). For the Synthetic dataset, we consider 
other eleven jobs (JobSynl - JobSynl 1) with a full projection, 
but with a selection predicate on the first attribute. These jobs have 
a selectively of 0.2% (2.2 million output records). All MapReduce 
jobs for both datasets select disjoint ranges to avoid caching effects. 
For all experiments, we report the average of three trials. 

Systems. We use Hadoop vO.20.203 and HAIL as baseline systems 
to evaluate the benefits of LIAH. For Hadoop, we use the default 
configuration settings, but increase the data block size to 256MB 
to decrease the scheduling overhead for Hadoop. In our experi- 
ments using the UserVisits dataset, HAIL creates one index 
for attribute sourcelP, one for attribute visitDate, and one 
for attribute adRevenue, just like in [13]. For our experiments 
using the Synthetic dataset, we simply assume that HAIL does 
not create an index on the first attribute. For LIAH we consider 
four different variants according to the offer rate (p) we use: LIAH 
(p = 0.1), LIAH (p = 0.25), LIAH (p = 0.5), and LIAH (p = 1). 
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Figure 5: LIAH Performance when running the first MapRe- 
duce job over Synthetic. 

6.2 Performance for the First Job 

Since LIAH piggybacks adaptive indexing on MapReduce jobs, 
the very first question that the reader might ask is: what is the ad- 
ditional runtime incurred by LIAH on MapReduce jobs? We an- 
swer this question in this section. For this, we run job JobUVl for 
the UserVisits dataset and job JobSynl for the Synthetic 
datasets. For these experiments, we assume that there is no block 
with a relevant index for jobs JobUVl and JobSynl. 

Figure 4 shows the job runtime for the four variants of LIAH 
for the Us e rVi sits dataset. In Cluster-A, we observe that LIAH 
has almost no overhead (only 1%) over HAIL when using an offer 
rate of 10% (i.e. p = 0.1). Interestingly, we observe that LIAH is 
still faster than Hadoop with p = 0.1 and p = 0.25. Indeed, the 
overhead incurred by LIAH increases along with the offer rate used 
by LIAH. However, we observe that LIAH increases the execution 
time of JobUVl by less than factor of two w.r.t. both Hadoop and 
HAIL, even though all data blocks are indexed in a single MapRe- 
duce job. We especially observe that the overhead incurred by 
LIAH scales linearly with the ratio of indexed data blocks (i.e. with 
p), except when scaling from p = 0.1 to p = 0.25. This is because 
LIAH starts to be CPU bound only when offering more than 20% 
of the data blocks (i.e. from p = 0.25). This changes when running 
JobUVl in Cluster-B. In these results, we clearly observe that the 
overhead incurred by LIAH scales linearly with p. We especially 
observe that LIAH benefits from using newer CPUs and have bet- 
ter performance than Hadoop for most offer rates. LIAH has only 
4% overhead over Hadoop when having p = 1. Additionally, we 
can see that LIAH has low overheads w.r.t. HAIL: from 10% (with 
p = 0.1) to 43% (with p= 1). 

Figure 5 shows the job runtimes for Synthetic. Overall, we 
observe that the overhead incurred by LIAH continues to scale lin- 
early with the offer rate. In particular, we observe that LIAH has no 
overhead over Hadoop in both clusters, except for LIAH (p = 1) 
in Cluster-A (where LIAH incurs a negligible overhead of ^3%). 
It is worth noting that when using newer CPUs (Cluster-B) LIAH 
has very low overheads over HAIL as well: from 9% to only 23%. 

From these results, we can conclude that LIAH can efficiently 
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Figure 6: LIAH performance when running a sequence of 
MapReduce jobs over UserVisits. 

create indexes at job runtime while limiting the overhead of writ- 
ing pseudo data block replicas. In particular, we observe the effi- 
ciency of the lazy indexing mechanism of LIAH to adapt to users' 
requirements via different offer rates. 

6.3 Performance for a Sequence of Jobs 

We saw in the previous section that LIAH can linearly scale the 
adaptive indexing overhead with the help of the offer rate. But, 
which are the implications for a sequence of MapReduce jobs? To 
answer this question, we run the sequence of eleven MapReduce 
jobs for each dataset: JobUVl - JobUVll for UserVisits and 
JobSynl - JobSynll for Synthetic. 

Figures 6 and 7 show the job runtimes for the UserVisit 
and Synthetic datasets, respectively. Overall, we clearly see 
in both computing clusters that LIAH improves the performance of 
MapReduce jobs linearly with the number of indexed data blocks. 
In particular, we observe that the higher the offer rate, the faster 
LIAH converges to a complete index. However, the higher the of- 
fer rate, the higher the adaptive indexing overhead for the initial 
job (JobUVl and JobSynl). Thus, users are faced with a natu- 
ral tradeoff between indexing overhead and the required number of 
jobs to index all blocks. But, it is worth noting that users can use 
low offer rates (e.g. p — 0.1) and still quickly converge to a com- 
plete index (e.g. after 10 job executions for p — 0.1). In particu- 
lar, we observe that after executing only a few jobs LIAH already 
outperforms Hadoop and HAIL significantly. For example, let us 
consider the sequence of jobs on Synthetic using p = 0.25 on 
Cluster-B. Remember that for this offer rate the overhead for the 
first job compared to HAIL is relatively small (11%) while LIAH 
is still able to outperform Hadoop. With the second job LIAH is 
slightly faster than HAIL and when running the fourth job improves 
over HAIL by more than a factor of two and over Hadoop by more 
than a factor of five 6 . As soon as LIAH converges to a complete in- 
dex, LIAH significantly outperforms HAIL by up to a factor of 23 
and Hadoop by up to a factor of 52. For the UserVi s it s dataset, 
LIAH outperforms HAIL by up to a factor of 24 and Hadoop by 
up to a factor of 32. Notice that, LIAH and HAIL increase the per- 
formance gap with Hadoop for Synthetic, because they signif- 
icantly reduce the size of this dataset when converting it to binary 
representation. 

In summary, the results show that LIAH can efficiently adapt to 
query workloads with a very low overhead only for the very first 
job: the following jobs always benefit from the indexes created in 
previous jobs. Interestingly, an important result is that LIAH can 
converge to a complete index after running only a few jobs. 
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Figure 7: LIAH performance when running a sequence of 
MapReduce jobs over Synthetic. 
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6.4 Eager Adaptive Indexing for a Sequence 
of Jobs 

We saw in the previous section that LIAH improves the perfor- 
mance of MapReduce jobs linearly with the number of indexed data 
blocks. Now, the question that might arise in the reader's mind is: 
can LIAH efficiently exploit the saved runtimes for further adaptive 
indexing? To answer this question, we enable the eager adaptive 
indexing strategy in LIAH and run again all UserVisits jobs 
using an initial offer rate of 10%. In these experiments, we use 
Cluster-A and consider LIAH (without eager adaptive indexing en- 
abled) with offer rates of 10% and 100% as baselines. 

Figure 8 show the result of this experiment. As expected, we 
observe that LIAH (eager) has the same performance as LIAH 
(p = 0.1) for JobUVl. However, in contrast to LIAH (p = 0.1), 
LIAH (eager) keeps its performance constant for JobUV2. This is 
because LIAH (eager) automatically increases p from 0.1 to 0.17 
in order to exploit saved runtimes. For JobUV3, LIAH (eager) 
still keeps its performance constant by increasing p from 0.17 to 
0.33. Now, even though LIAH (eager) increases p from 0.33 to 1 
for JobUV4, LIAH (eager) now improves the job runtime as only 
40% of the data blocks remain unindexed. As a result of adapt- 
ing its offer rate, LIAH (eager) converges to a complete index only 
after 4 jobs while incurring almost no overhead over HAIL. From 
JobUV5, LIAH (eager) ensures the same performance as LIAH 
(p = 1) since all data blocks are indexed, while LIAH (p = 0.1) 
takes 6 more jobs to converge to a complete index. 

These results show that LIAH can converge even faster to a com- 
plete index, while still keeping a negligible indexing overhead for 
users' MapReduce jobs. Overall, these results demonstrate the high 
efficiency of LIAH (eager) to adapt its offer rate according to the 
number of already indexed data blocks. 

6.5 Invisible Projection for the First Job 

So far, we have considered MapReduce jobs that project all at- 
tributes. However, this is not always the case in practice. There- 
fore, it is also important to answer the following question: how 
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well does LIAH deal with MapReduce jobs that project only a sub- 
set of attributes from their input datasets? We focus on answering 
this question in this section. To do so, we now enable the invisible 
projection technique from Section 5.3 for LIAH. 

In these experiments, we consider two variants of LIAH: one 
without invisible projection (LlAHwoInvPrj) and another with in- 
visible projection (LIAHwInvPrf). We consider a constant offer rate 
of 25% for these two variants of LIAH. For both UserVisits 
and Synthetic, we run a MapReduce job with a selection pred- 
icate on the first attribute and vary the number of projected at- 
tributes. Overall, the main goal of these experiments is to mea- 
sure the overhead of LlAHwInvPrj (i.e. reading all attributes) over 
LlAHwoInvPrj (i.e. reading only the required attributes). To bet- 
ter evaluate the invisible projection technique, we assume that no 
index exists in UserVisits and Synthetic. We run these ex- 
periments on Cluster-A. 

Figure 9 shows the results for UserVisits. We observe 
that, when JobUVl projects only the first attribute, LlAHwInvPrj 
incurs an overhead of almost 45%. Indeed, this is partially 
because LlAHwInvPrj has to read eight attributes more than 
LlAHwoInvPrj. However, most of this overhead is for reading 
the second attribute (destURL), which is the largest attribute in 
UserVisits. As soon as LlAHwoInvPrj also reads the second 
attribute, LlAHwInvPrj incurs an overhead of only ~19% (i.e. 2x 
less overhead), even if LlAHwInvPrj reads seven more attributes. 
Then, LlAHwInvPrj lowers its overhead by ~ 2% along with the 
number of projected attributes by LlAHwoInvPrj. But, as soon as 
LlAHwoInvPrj reads the second largest attribute (i.e. the 5th at- 
tribute) in UserVisits, LlAHwInvPrj again decreases its over- 
head by roughly a factor of 2, i.e. it now incurs an overhead of only 
9%. From this point, the overhead caused by LlAHwInvPrj starts 
to be negligible. 

We saw in the results for UserVisits that LlAHwInvPrj in- 
curs a low overhead overall, especially as soon as LlAHwoInvPrj 
reads the most expensive attributes. Thus, the question that arises 
is: how good is the invisible projection technique for scientific-like 
datasets, where most attributes are of the same size? This is why 
we run again the invisible projection over the Synthetic dataset. 

Figure 10 shows the results for LlAHwInvPrj on the 
Syntnhetic dataset when having a very selective job (the job 
outputs only one tuple). Notice that, we consider a very selective 
job for this experiment in order to clearly see the impact of writing 
adaptively created indexes to disk. We observe that LlAHwInvPrj 
has an acceptable overhead of 18% on average and a low over- 
head when projecting more than the half of the total number of 
attributes. What it is interesting to highlight in these results is that 
the impact of the LlAHwInvPrj over LlAHwoInvPrj is noticeable, 
because the MapReduce jobs outputs only a single tuple. To sup- 
port this claim, we additionally ran a series of experiments with a 
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Figure 10: Invisible projection overhead for Synthetic with 
very high job selectivity. 

very lowly selective job that outputs 80% of the incoming tuples. In 
those experiments, we observed that LlAHwInvPrj has a negligible 
6% overhead on average over LlAHwoInvPrj. 

Additionally, we evaluated LlAHwInvPrj on Cluster-B using an 
offer rate of 25% and the Synthetic dataset, but we do not report 
the results here because of space constraints. Overall, we observed 
in this additional experiment that LlAHwInvPrj incurs a negligible 
overhead of 1 % on average over LlAHwoInvPrj. These results also 
showed that LIAH significantly benefits from using newer CPUs. 

In summary, the results we presented in this section demonstrate 
the high efficiency of the invisible projection technique to deal with 
partial projections. 

7. CONCLUSION 

Several research works have improved the performance of 
MapReduce jobs significantly by integrating indexing into the 
MapReduce framework [28, 25, 12, 23, 13]. However, none of 
these indexing techniques can adapt to changes in users' workload 
as they create indexes upfront. Therefore, these indexing tech- 
niques are not suitable for applications where workloads are hard 
to predict, such as in scientific applications and social networks. 

In this paper, we proposed LIAH (for Lazy Indexing and Adap- 
tivity in Hadoop), a parallel, adaptive approach for indexing at min- 
imal costs in MapReduce systems. LIAH creates clustered indexes 
on data blocks as byproduct of MapReduce job execution. As a 
consequence, LIAH can adapt to changes in user's workloads. The 
beauty of LIAH is that it efficiently piggybacks index creation on 
the existing Hadoop MapReduce pipeline. Hence, LIAH not only 
has no additional read I/O-costs, but it is also completely invisi- 
ble for both the MapReduce system and users. A salient feature of 
LIAH is that, besides distributing indexing effort across multiple 
computing nodes, it also parallelises indexing with map tasks com- 
putation and disk I/O. Furthermore, LIAH can adjust the maximum 
number of data blocks to index in parallel with a single MapRe- 
duce job. In particular, we proposed eager adaptive indexing, a 
technique that allows LIAH to reinvest the runtime benefits from 
indexes created by previous MapReduce jobs for further indexing. 
Thereby, LIAH can trade the number of jobs to complete indexing a 
dataset with early job runtime improvements. This allows LIAH to 
scale indexing effort according to hardware capabilities and users' 
needs. As a result, in contrast to existing adaptive indexing works, 
LIAH incurs very low (or invisible) indexing overheads even for 
the first query that triggers the creation of a new index. Still, LIAH 
quickly converges to a complete index, i.e. all HDFS data blocks 
are indexed. Additionally, we introduced the invisible projection 
and lazy projection techniques, which allow LIAH to efficiently 
create clustered indexes even if incoming jobs project only a subset 
of attributes from their input datasets. 



We experimentally evaluated LIAH and compared it with 
Hadoop and HAIL, using two different datasets (UserVisits 
and Synthetic) and computing clusters. The results demon- 
strated the high superiority of LIAH: LIAH runs MapReduce jobs 
up to 52 times faster than Hadoop and up to 24 times faster 
than HAIL. In particular, the results showed that LIAH signifi- 
cantly outperforms Hadoop in almost all scenarios, except for the 
very first query when using an offer rate of 100%. With respect 
to HAIL, LIAH has a very low indexing overhead (e.g. 1% for 
UserVisits when using an offer rate of 10%) only for the very 
first job. The following jobs already run faster than HAIL, e.g. ^2 
times faster from the fourth job with an offer rate of 25%. The re- 
sults also showed that, even for low offer rates, LIAH converges to 
a complete index after running only a few number of MapReduce 
jobs. For example, LIAH converges to a complete index after 10 
jobs with an offer rate of 10%. All this demonstrates the high effi- 
ciency of LIAH to (i) balance indexing effort, (ii) create clustered 
indexes at job runtime, and (iii) adapt to users' workloads. The 
results also showed that the invisible projection technique incurs a 
negligible overhead (e.g. 6% on average for Synthetic dataset). 
Acknowledgments. Research partially supported by BMBF. 
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APPENDIX 

A. LAZY PROJECTION 

In Section 5.3 we presented invisible projection, a technique for 
efficiently creating clustered indexes even if incoming MapReduce 
jobs projects a subset of attributes. The main idea behind invisible 
projection is to read the unprojected attributes for those data blocks 
that are proposed by map tasks to the Adaptive Indexer. This way 
the Adaptive Indexer can align all attributes inside a data block with 
respect to the indexed attribute. Our results from Section 6.5 shows 
that the invisible projection incurs a very low overhead on average. 
However, invisible projection might incur higher overheads when 
MapReduce jobs project only a low percentage of attributes. For 
example, we observe in Figure 9 that invisible projection incur an 
overhead of ~45% when projecting only the first attribute (out of 
eight) from the UserVisist s dataset. 

We thus propose lazy projection, a technique that allows LIAH to 
efficiently create clustered indexes when MapReduce jobs project 
only a small proportion of attributes. In contrast to the invisible 
projection technique, the main idea of lazy projection is to read 
only the projected attributes in order to minimise the additional read 
I/O for partial projections. This means that, with the lazy projec- 
tion technique, LIAH reads exclusively the attributes requested by 
users. Thus, map tasks might pass potentially incomplete blocks to 
the Adaptive Indexer. The Adaptive Indexer, in turn, apply sorting 
and reordering only on the available subset of attributes. 

Let's consider again our job example of Section 5.3. Recall 
that jobd filters records based on attribute d and projects only at- 
tribute b. In this example, using the lazy projection, the Index 
Builder first sorts attribute d and thereby create a permutation vec- 
tor as described in Section 4.2.1. Then, the Index Builder reorders 
the available attribute b according to the permutation vector. Fi- 
nally, the Index Writer creates the corresponding metadata and 
stores the just created partial clustered index as a partially pseudo 
data block replica. In contrast to pseudo data block replica, a par- 
tially pseudo data block replica contains the permutation vector of 
attribute d. This permutation vector indicates the Index Builder 
how to reorder attributes for alignment w.r.t. the indexed attribute d. 

Now, assume that another incoming job job' d , with a filter con- 
dition on attribute d, which also projects attributes c besides at- 
tribute b. In this case, the LIAH RecordReader uses the previ- 
ously created partially pseudo data block replicas to perform an 
index access on attribute d so as to read only the qualifying val- 
ues from attribute b. Additionally, the LIAH RecordReader reads 
the persisted permutation vector of attribute d. Then, the LIAH 



RecordReader uses the normal data block replica (stored locally) 
to load the missing attribute c. Once attribute c is main memory, 
the LIAH RecordReader uses the permutation vector to pass the 
qualifying values from c, together with those from 6, to the map 
function. When all qualifying records are passed to the map func- 
tion, the LIAH RecordReader then passes the permutation vector 
and all the data from attribute c, which is already in main memory, 
to the Adaptive Indexer. Internally, the Index Builder reorders at- 
tribute c according to the permutation vector and passes the result 
to the Index Writer. Finally, the Index Writer locates the matching 
partially pseudo data block replica, appends the aligned attribute c, 
and updates the block metadata. In other words, lazy projection 
tolerates missing attributes in partially pseudo data block replicas. 
Lazy projection incrementally completes a partially pseudo data 
block replica whenever the replica (i) has the right index to per- 
form an incoming MapReduce job, and (ii) does not contain all the 
attributes projected by the incoming job. Once a partially pseudo 
data block replica is complete, i.e. it contains all attributes, the In- 
dex Writer simply deletes the permutation vector. 

Therefore, the advantage of lazy projection is that it allows LIAH 
to fully incrementally adapt its indexes to users' workloads. Fur- 
thermore, lazy projection also allows LIAH to reduce space con- 
sumption, because only those attributes that are actually required 
by MapReduce jobs are stored in pseudo data block replicas. 



